Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fn:parse-html and Validator.nu dependency #2258

Merged
merged 17 commits into from
Jan 24, 2025

Conversation

GuntherRademacher
Copy link
Member

These changes

  • replace the dependency on TagSoup by a dependency on Validator.nu's HTML parser,
  • replace TagSoup options by Validator.nu options,
  • use Validator.nu for implementing html:parse,
  • add function fn:parse-html, based on html:parse (but, in contrast to that, without falling back to the XML parser in case Validator.nu is unavailable).

Some slight restructuring of StandardFunc and FuncOptions was necessary to be able to produce error FODC0012.

@GuntherRademacher
Copy link
Member Author

GuntherRademacher commented Jan 22, 2025

The latest changes to this PR re-integrate TagSoup, which remains to be the default HTML parser for everything except fn:parse-html, the latter using Validator.nu as the default.

HtmlOptions integrates the options from both TagSoup and Validator.nu. Both for fn:parse-html, html:parse, and the other users of HtmlParser that supply HtmlOptions, the method option can be used to select the parser to be used. This will be TagSoup for option values xml and html, and Validator.nu for option value nu.

Out of the 1379 QT4-tests for fn:parse-html, 1367 are now passing, and 12 still fail. These still need to be inspected in more detail.

@ChristianGruen ChristianGruen merged commit e30a80e into BaseXdb:main Jan 24, 2025
@ChristianGruen ChristianGruen deleted the parse-html branch January 24, 2025 13:26
@ChristianGruen
Copy link
Member

ChristianGruen commented Jan 24, 2025

Thanks, was everything fine. Next, we could tackle the options of fn:html-parse and decide which ones we want to support, or which ones we want to discuss in the qtspecs repository. For example, I am not sure whether encoding should be retained, but it could be useful for the still-tobe-added fn:html-doc function.

@GuntherRademacher GuntherRademacher restored the parse-html branch January 28, 2025 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants